Skip to content

Conversation

anamikac-intel
Copy link

@anamikac-intel anamikac-intel commented Sep 29, 2025

Modify 00_bmg_gemm to include new mma and copy atoms (#477).
00_bmg_gemm combines two parts: mma and epilogue. To add new atom changes, we need to update both parts since they currently use old atoms. As starting we will:

Keep CollectiveEpilogue unchanged for now
Only modify CollectiveMma first

Old Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [96.448]TFlop/s (1.7813)ms

New Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

@anamikac-intel anamikac-intel marked this pull request as ready for review September 29, 2025 08:11
@anamikac-intel anamikac-intel changed the title Use newer version on mma_atom and copy_atom in 00_bmg_gemm Use newer version of mma_atom and copy_atom in 00_bmg_gemm Sep 30, 2025
…d_copy_*, and move tensor/copy initialization to host-side params in to_underlying_arguments
Copy link

@petercad petercad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with the minor changes suggested above.

Edit -- there is a bug in the TiledCopy handling that needs fixing, described below.

@sanchitintel
Copy link

Hi @anamikac-intel, with this PR, I'm encountering the same errors locally as the CI.
Are you using a more recent igc version at your end?
I'm using https://github.com/intel/intel-graphics-compiler/releases/tag/v2.18.5.

Thanks!

@petercad
Copy link

With New Atom perf increase by 2x

Theoretical bf16 peak perf for BMG is 116 TF/s, so the new performance is too high. Either there's a problem in the kernel (not doing the full computation) or something's wrong with the performance computation.

@sanchitintel

This comment was marked as outdated.

@tdeng5
Copy link

tdeng5 commented Oct 16, 2025

we checked some shapes' performance:
image

tdeng5 pushed a commit that referenced this pull request Oct 17, 2025
Fixes a compilation failure found in #540 when >2D tensors are passed to
one of the `make_block_2d_copy_*` functions.
@Antonyvance Antonyvance added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 17, 2025
@Antonyvance Antonyvance added this to the 0.6 milestone Oct 17, 2025
using ArchTag = typename DispatchPolicy::ArchTag;

static_assert(platform::is_same<ElementA, ElementB>::value, "MainloopIntelXeXMX16 requires that A and B have same type.");
static_assert(platform::is_same<ElementA, ElementB>::value, "MainloopXeL1Staged requires that A and B have same type.");
Copy link

@sanchitintel sanchitintel Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing MMA collective code, we use variable names ATOM_M, ATOM_N, ATOM_K incorrectly, because they don't correspond to the underlying MMA atom, but to our tiling scheme instead.

  static constexpr int ATOM_M = get<1>(typename TiledMma::ThrLayoutVMNK{}.shape());
  static constexpr int ATOM_N = get<2>(typename TiledMma::ThrLayoutVMNK{}.shape());
  static constexpr int ATOM_K = get<3>(typename TiledMma::ThrLayoutVMNK{}.shape());

Workgroup tiles are divided spatially into sub-group fragments/tiles.

For example, the variable ATOM_M is actually the number of partitions of WG_M in subgroup tiles that comprise a workgroup tile. i.e. The variable ATOM_M means WG_M/SG_M, and is not representative of the atom's M dimension.

Can we rename these variables in this PR? It's not necessary for correctness, but just for understanding the code.

Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but we should fix it in another PR because our new feature in the latest release strongly depend on this PR, we expect this PR to be merge ASAP.

@anamikac-intel
Copy link
Author

anamikac-intel commented Oct 19, 2025

Performance results: new vs legacy implementation on different problem sizes (Tested on IGC 2.20)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants